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Abstract 

There is an increasing body of evidence suggesting that exact nearest neigh- 
bour search in high- dimensional spaces is affected by the curse of dimen- 
sionality at a fundamental level. Does it necessarily mean that the same is 
true for k nearest neighbours based learning algorithms such as the /c-NN 
classifier? We analyse this question at a number of levels and show that the 
answer is different at each of them. As our first main observation, we show 
the consistency of a approximate nearest neighbour classifier. However, 
the performance of the classifier in very high dimensions is provably unsta- 
ble. As our second main observation, we point out that the existing model 
for statistical learning is oblivious of dimension of the domain and so every 
learning problem admits a universally consistent deterministic reduction to 
the one-dimensional case by means of a Borel isomorphism. 
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1. Introduction 

Local learning algorithms such as fc-NN classification or fc-NN regression 
occupy an important place in statistical learning theory, further enhanced by 
a surprising recent result (l| stating that every consistent learning algorithm. 
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£, in the Euclidean space is "localizable" in a suitable sense. Suppose that 
we show to C only data in the r-ball around each point x, 

X ^ C{Br{x) n a){x) = f{x). 

sample 




Figure 1: Towards the notion of localizability. 



Now smooth the resulting predictor function / over the r-g-neighbourhood 
of each x: 

X ^ E{f\B,^{x)) = g{x). 

Zakai and Ritov have shown that if r, g 4 sufficiently slowly, then the 
resulting "localized" predictor g is consistent. 

It is therefore of obvious interest to examine the question of performance 
of local learning algorithms in high dimensional domains. Are they provably 
affected by the curse of dimensionality? In this article, we will concentrate 
on the classical fc-NN classifier. The question turns out to be many-layered, 
and the answer is different at every layer that we peel back. 

Perhaps the most basic consideration is that in order to run the /c-NN 
classifier, one needs to be able to efficiently retrieve k nearest neighbours to 
every input point of the domain. For smaller datasets, this problem is solved 
by means of a complete sequential scan of data. However, for larger datasets 
this becomes impracticable, and considerable efforts of the data engineering 
community go towards designing various indexing schemes assuring faster 
similarity search [2I, [sj . 

In spite of all the progress in the area, there is a considerable body of 
evidence in support of the so-called curse of dimensionality conjecture j3] 
affecting the exact deterministic nearest neighbour search in high dimensional 
spaces. Recall that the Hamming cube of rank d, {0, l}*^, is the collection of 
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all n-bit binary strings equipped with the Hamming distance counting the 
number of bits where two strings differ: 

rf(cr,r) = tl{i: ^ Ti}. 

Recall also that uj{1) denotes the class of integer sequences nd that go to 
infinity, 0(1) denotes the class of all bounded sequences, and o{d) denotes 
the class of integer sequences Ud that are infinitely small with regard to d, 
that is, Ud/d — )■ as d — ?■ oo. For example, the notation n G d^^^"* means 
that n grows faster than any finite power of ci, while n G 2°*^°'^ means n grows 
slower than the ci-th power of any number a > 1. 

In its simplest form, the conjecture states that a dataset X with n points 
in the (i-dimensional Hamming cube {0,1}'^, where n G d'^^^'' fl 2°*^'^\ does 
not in general admit a data structure of size rP^^'^ (that is, polynomial in n) 
which supports exact deterministic similarity search in X in time d^'^^'^ (i.e., 
polynomial in d). 

Even though the conjecture remains unproven in general, it has been es- 
tablished for some specific indexing schemes [^]. Should the conjecture be 
proved, the fc-NN algorithm will be affected by the curse of dimensionality 
simply because of a theoretical impossibility to retrieve the k nearest neigh- 
bours in time polynomial in the dimension d of the domain VL without the 
need to store a prohibitive amount of data (superpolynomial in the size n of 
the actual dataset). 

However, it turns out that the exact nearest neighbour search is not in- 
dispensable. Approximate nearest neighbour (ANN) search j^, 0] is known 
to admit more efficient indexing schemes than exact NN search and approx- 
imate nearest neighbours can be substitued in a classifier in place of exact 
ones. As our first main result, we propose a new local classification algorithm 
based on k approximate nearest neighbours (fc-ANN classifier), and prove its 
consistency under the assumption of absolute continuity of the data distribu- 
tion. Theorem 15.11 is a (partial) extension of the classical Stone consistency 
theorem jsf. 

At the same time, we observe that in the asymptotic setting of high di- 
mensions, — )■ oo, the /c-ANN classifier is affected by what may be regarded 
a variant of Hughes's phenomenon Namely, the number of datapoints 
required to maintain the consistency of the algorithm must provably grow ex- 
ponentially with the dimension of the domain, which assumption is of course 
unrealistic. Thus, at least in an artificial theoretical setting of data sam- 
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pled randomly from high dimensional distributions, switching to the ANN 
classifier does not lift the curse of dimensionality. 

Here comes the second main result of the article which, in spite of its 
simplicity, is quite interesting (Theorem 17. ip : Stone's theorem is insensitive 
to the Euclidean structure on the domain as long as the underlying Borel 
structure remains invariant. This allows for a very simple "Borel isomorphic 
data reduction" to the one-dimensional case, after which the fc-NN algorithm 
still remains universally consistent. Moreover, such a consistent reduction 
to the one- dimensional real case applies to functional data classification in 
infinite-dimensional spaces, in fact in any separable metric space. 

The Borel structure is a subject of study of the descriptive set theory 



10[. It is a derivative of the usual topology of the Euclidean or infinite- 
dimensional Banach space, and a considerably coarser structure than the 
topology, preserving significantly less information. As a result, Borel iso- 
morphisms between the domains — that is, bijections preserving the Borel 
structure, possibly discontinuous at every point — are very numerous and 
easy to come by. Every Euclidean space M", in fact every infinite-dimensional 
Banach or even Frechet linear space is Borel-isomorphic to the real line, and 
the corresponding isomorphisms can be easily managed at an algorithmic 
level and implemented in code. While the Borel structure is widely used in 
various parts of pure mathematics, including foundations of theoretical prob- 
ability, we are unaware of examples of it being employed for the purposes of 
algorithmic data analysis. 

The practical significance of this observation still remains to be seen (it 
has only been tested on a few toy datasets from the UCI repository, with 
encouraging results), but on a theoretical level it brings up the problem of 
what is "dimension" in the context of statistical learning. We conclude the 
paper with a small discussion. 

The presentation of results in our paper follows the order in the Intro- 
duction, and is preceded by a reminder of the standard model of statistical 
learning and the Stone consistency theorem. 



2. Stone's theorem 

Here we recall a fundamental result which serves as a theoretical justifi- 
cation for the /c-NN classifier. 

The domain is, in the case of main interest for us, a c?- dimensional Eu- 
clidean space, Q = M.'^. However, it can be any complete separable metric 
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space. The Borel structure on is the smallest family, of subsets of Q, 
which contains all open balls and is closed under countable unions and com- 
plements. A function / : f2 — )■ R is Borel measurable if the inverse image of 
every interval (a, b) (equivalently, of every Borel subset of the real line) under 
/ is a Borel subset of Q. (For a more detailed discussion, see Subs. I7.1[ ) A 
Borel probability measure /i on is a countably-additive function on ^ with 
values in the interval [0, 1], satisfying = 1. 

Data pairs {x,y), where x & Q and y G {0, 1}, follow an unknown prob- 
ability distribution /i (a Borel probability measure on x {0,1}). Denote 
L{Q, {0, 1}) the collection of all Borel measurable binary functions on the do- 
main. Given such a function f : Q ^ {0, 1} (a classifier), the misclassification 
error is defined by 

err^(/) = y) e n x {0, 1} : f{x) ^ y}. 

The Bayes error is the infimal misclassification error over all possible classi- 
fiers: 

r=r(/i) = inferr^(/). 
A learning rule is a family C = {jCn)'^^i, where 

£„: fi" X {0,1}" ^ L(fi,{0,l}), n = 1,2,... 
and the associated evalution maps 

X {0,1}" xn3 {a,x)^ Cn{x,y){z) e {0,1} 

are Borel. Here a = {xi, . . . , x„, yi, . . . , yn) is a labelled learning sample. 

For example, the fc-NN classifier is defined by selecting the value £„(cr)(a;) 
in {0, 1} by the majority vote among the values of y corresponding to the 
k = kn nearest neighbours of x in the learning sample a. For even k, ties may 
occur, which are broken with the help of random orders on the neighbours. 

Data is modelled by a sequence of independent identically distributed 
random elements (X„, y„) of f2 x {0,1}. Denote x a sample path. Then 
the learning rule Cn only gets to see the first n labelled coordinates of x. A 
learning rule C is consistent if err^£„(x) — )■ i* in probability as n — t- oo. If 
the convergence occurs almost surely, then C is said to be strongly consistent. 
Finally, C is universally consistent if it is consistent under every probability 
measure fi. Strong universal consistency is defined in a similar way. 
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Theorem 2.1 (Stone Q). Let k = — i- oo and kn/n — )■ 0. Then the 
k-NN classification algorithm in Mf^ (with regard to the Euclidean distance) 
is universally consistent. 

The conclusion was subsequently strengthened to strong universal consis- 



tency, cf. Chapter 11 in [11[ and historic references. 

Stone's theorem fails in more general metric spaces, even in an infinite- 
dimensional Hilbert space i"^. One can construct a deterministic concept in 
not learned by the /c-NN classifier over a gaussian distribution (cf. an 
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example in [12[, pp. 351-352, based on a contruction of Preiss [13[). 

An alternative proof of the consistency of the /c-NN classifier, based on 
the Lebesgue density theorem for the Euclidean space, was given in [l3|, and 
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12[ it was further shown that the fc-NN classifier is universally consistent 
in every metric space satisfying the Lebesgue-Besikovitch density theorem. 
Such metric spaces have been completely characterized by Preiss [isl (they 
are the so-called sigma-finite dimensional metric spaces, cf. also [l6|). It 
would be quite interesting to give a formal proof that the universal consis- 
tency of the fc-NN classifier in a metric space is equivalent to the validity of 
the Lebesgue-Besikovitch density theorem, and further to modify the origi- 
nal proof of Stone to make it work for every sigma-finite dimensional metric 
space. This would in particular lead to a new proof of the density theorem 
of real analysis using tools of statistical learning theory. 

Among the factors affecting the performance of the /c-NN classifier in 
a high-dimensional space, the need to retrieve k nearest neighbours of an 
input datapoint in an effective and efficient way is most apparent, and we 
will proceed to it now. 

3. Exact similarity search 

Let {^,p) be a metric space, and X (1 Q a, finite subset (dataset). The 
triple W = {Q, p, X) is a similarity workload. The k-nearest neighbour query 
is: given g G f2, return k nearest neighbourhs to k in X. In practice, it is 
often reduced by means of binary search to a sequence of e-range similarity 
queries: given q E fl and e > 0, return all x e Bs{q) fl X, where B^{q) 



denotes the e-ball around q. See [17 



An access method for a workload W is an algorithm that correctly an- 
swers every range query. Principal examples of access methods are indexing 
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schemes, in particular hierarchical tree-based indexing schemes. One popu- 



lar version of such a scheme as the M-tree [18| . For varying discussions of 
indexing schemes for similarity search in metric spaces, see 

The curse of dimensionality for access methods into high- dimensional do- 
mains is a well-known phenomenon among the practitioners, even if it is hard 
to pinpoint a well- documented reference (see however i^]). At a theoreti- 
cal level, the following open "curse of dimensionality conjecture" sums up a 
rather commonly held belief in the curse of dimensionality being inherent in 
high dimensional data. 

Conjecture 3.1 (cf. H). LetX C {0,1}^ he a dataset with n points, where 
the Hamming cube {0, l}'^ is equipped with the Hamming (i^) distance: 

d{x,y) = tl{i: Xi ^ yi}. 

Suppose d = n°^^\ but d = uj{logn). (That is, the number of points in X has 
intermediate growth with regard to the dimension d: it is superpolynomial in 
d, yet subexponential. ) Then any data structure for exact nearest neighbour 
search in X , with d"^^^^ query time, must use n^*^^-* space within the cell probe 
model of computation. 



For the cell probe model of computation see 2]J; in the context of in- 
dexing schemes it is briefly discussed in (sj. The best lower bound presently 



known for polynomial space data structures is Q{d/\ogn) j22|. (See also [23 
for some later improvements.) Rigorous lower bounds superpolynomial in d 
have been established for a number of concrete indexing schemes, notably 



the pivot-based schemes [2J, |5| and the metric trees [25 



4. Approximate similarity search 

The (c, e)- approximate nearest neighbour searc/i problem 0| is stated thus: 
given e and c> 0, for a. q E Q, if eNN^q) < £, then return a datapoint x G X 
at a distance d{q, x) < (1 + c)e. (Here ej\f]\f{q) denotes the distance from q to 
the nearest neighbour in X.) 

The known indexing schemes for approximate nearest neighbour (ANN) 
search jo], 0, S l^l are more efficient than those for exact NN search. 

In order to be used for classification, the (c, e)-ANN problem has to be 
modified in the following way. The {k, c) approximate nearest neighbours 
[k-ANN) problem says: given q and c > 0, return k datapoints contained 
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Figure 2: (c, e)- ANN search. 



query center 
k-NNs 

k— ANNs returned 




Figure 3: /c-ANN search. 



within the distance (1 + c)ek.NN of the query point. Here Ek-NNil) is the 
smallest radius of a ball around q containing k datapoints. 

The indexing schemes based on random projections, random matrices, or 
locality sensitive hashing can be adapted to answer the {k, c)-ANN query. 

As an example, let us consider the scheme originally developed in [3] and 
reformulated in [26|, Section 7.2 using random binary matrices. The core of 
the approach is an indexing scheme into the Hamming cube {0, l}*^, which is 
afterwards converted into an indexing scheme for M*^ by discretization. Let 
< £ < 1, and denote X C {0, l}'^ the dataset with n points. 

Fix a range i = 1,2, ... ,d. The scheme for the range i consists of a 
family of mappings from {0, 1}^ onto a cube of a smaller dimension k = 
0{e~'^\g2'n), with the following property. If g G {0, 1}'^ and a random h G 

is chosen (with regard to a certain probability distribution), then with 
a constant confidence 1 — ^ the mapping h preverves distances in the set 
X U {q} on the scale i/2, i = 1,2, . . . ,d, to within an additive error ie, and 
on a larger scale — away from it. In the scheme under consideration, the 
map h is a multiplication on the right hj a d x k matrix with random i.i.d. 
Bernoulli entries assuming values 1 and with probabilities 1/i and 1 — 
l/£, respectively. (The operations are carried mod 2.) The target cube only 
contains 2'^^^ ") = poly (n) points, and is indexed to efficiently answer 
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a nearest neighbour query via hashing. The indexing scheme consists of a 
sufficiently large family of such functions h for every possible range i. 

If we are now interested in an (l+e)-approximate nearest neighbour query, 
a binary search in i finds the smallest range so that a randomly chosen h 
only returns one nearest neighbour to h{q) at a distance i. This neighbour 
is of the form h{x), a; G X; the point x is returned. With confidence 1 — 5, 
this is a (1 + £:)-approximate nearest neighbour of q in the original Hamming 
cube. 

If we want to increase the confidence, the algorithm is run repeatedly, 
and among the obtained points xi,X2, ■ ■ ■ the nearest one to q is returned. 
Building the scheme takes time polynomial in n and the algorithm answers 
every query q with high confidence in time 0(ci poly \og{dn)). 

A modification for the fc-ANN problem is now obvious. First, a binary 
search in i determines the smallest value i such that for the corresponding 
randomly chosen h the £-ball around h{q) contains at least k datapoints. 
Then k nearest neighbourhs, h{xi), . . ., h{xk) to h{q) in the cube of small 
dimension (image of h) are retrieved, and the corresponding original points 
Xi, X2, . . . ,Xk returned. With a constant confidence 1 — 6, they will be (/c, e) 
approximate nearest neighbours of q. Again, in order to make the confidence 
as high as desired, the procedure is repeated as many times as necessary, all 
returned points are put in a bucket, and the k nearest neighbours to q among 
them are returned. The only change in the indexing scheme is that the hash 
table now stores k nearest neighbours instead of one. The running time of 
the algorithm is now 0{dkpolj log{dn)). 

5. The k approximate nearest neighbour classifier 

This section contains the ffist of two main new results reported in the 
article: an extension of the classical Stone consistency theorem Q to an 
approximate nearest neighbour-based classifier. 

5.1. Definition and statement of result 

Fix a c > 0. The value of the /c-ANN classifier (more exactly, {k, c)-ANN 
classifier) at a point x is determined by the majority vote among {k, c)- 
approximate nearest neighbourhs of x, as returned by an indexing scheme. 
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Theorem 5.1. Suppose the underlying data distribution fx on x {0,1} 
has density (that is, is absolutely continuous with regard to the Lebesgue 
measure). Let c > be fixed, and let 



Remark 5.2. Notice that no assumption is made about the nature of the 
algorithm for answering {k, c)-ANN queries. The task can even be entrusted 
to an adversary who is aware of the underlying distribution /i: this will not 
affect the consistency, though possibly slow down the rate of convergence. 

Remark 5.3. The assumption of absolute continuity of the unknown dis- 
tribution /i allows us to avoid dealing with ties in the proof below. For the 
moment, we do not know whether this assumption can be dropped. 

Remark 5.4. We also do not know how essential is the assumption that k 
grows strictly faster than the logarithm of n. 

5.2. A variation on Stone's theorem 

Here is a slightly strengthened version of Stone's theorem ([ll|. Theorem 



Theorem 5.5. Let fi be a probability measure on MP" x {0, 1} with regard 
to which the datapoints are drawn as i.i.d. random variables. Suppose that 
Wn^i = Wn^i{x, Xi, X2, . . . , Xn) arc data- dependent weights (random measur- 
able functions on M'^j which are nonnegative, sum up to one, 



k G uilogn) n o{n). 



Then the {k,c)-ANN classifier is consistent. 



□ 



6.3). 



n 



i=l 



and satisfy the properties: 



(i) For some c > and every Borel subset A C M"', 
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(a) For all a > 0, 




0. 



(Hi) 



lim E < max Wn,i{x) 



0. 



n 



Define the classification rule Qn based on the majority vote among all the 
values Yi each given the Wn,i{x) share of total vote. If fi has density, then 
the rule Qn is consistent. 

Remark 5.6. By approximating a bounded measurable function with sim- 
ple functions, the condition (i) is seen to be equivalent to 

(i') There is a constant c > such that for every bounded measurable 
function on M.'^ with values in the interval, 



On the proof of Theore m 15.51 It follows the proof of Stone's Theorem 
(Theorem 6.3 as presented in [ll| on pages 98-100) practically word for word. 

Notice that the conditions [ii) and {Hi) are the same, it is only the con- 
dition (i) that has been relaxed. The condition (i) is only used in the proof 
once, to obtain the last inequality in the chain of inequalites at the end of 
page 99. 

The functions r] and rj* in the proof both take their values in the interval 
[0, 1]: the former is the density of fi with regard to its projection on M'^, while 
the latter is a compactly supported uniformly continuous approximation to 
rj in the L^-norm. Therefore, the function {ri{X) — ri*{X)y takes values in 
[0, 1] as well. Thanks to (i'), the required inequality holds approximately, to 
within any wanted error, if n is large enough. Thus, in the first displayed 
formula on top of page 100 one can replace the upper bound of 3£:(1 + 1 + c) 
with, for example, 4e(l + 1 + c), provided n is large enough. This will do just 
as well. The rest of the proof remains unchanged. 
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5. 3. Proof of Theorem I5.il 

We apply Theorem 15.51 with the weights Wn^i defined as follows: Wn^i{x) = 
l//c if Xj is among the {k, c)-approximate nearest neighbours of x as returned 
by the oracle (the indexing scheme), and otherwise. The interpretation of 
the expected value will depend on whether the indexing scheme is assumed 
to be randomized or deterministic, however this does not affect the proof. 

Clearly, the weights are non-negative. Since for any given x there are 
precisely k {k, c)-approximate nearest neighbours returned, all but k weights 
vanish at the point x, and the weights add up to one almost surely. 

The condition {ii) follows from a classical observation of Cover and Hart 



27j that in every separable metric space equipped with a Borel probability 



measure, the 1-Lipschitz function e^-nn (the smallest radius of a ball contain- 
ing k nearest neighbours among n datapoints) will converge to zero almost 
surely provided k/n — )■ 0. 

The condition {Hi) follows from the definition of weights and the assump- 
tion k — > oo. 

It remains to verify the condition (i). Denote for a > and x G M*^ 

Ea{x) = inf {e > : fi (-Be(x)) > a} . 

Now let / = dfi/dX be the density, that is, the Radon-Nikodym derivative 
of the underlying measure on with regard to the Lebesgue measure. By 
the Lebesgue differentiation theorem, 

{Be^(l+c){x)) A (^ea(l+c)(a^)) \{BeSx)) 

A(fi,4i+e)(x)) ■ A(i?e„(x)) ' ^i{B,Sx)) 
^ f{x)-{l + cY-f{x)-' 
= + 

where the convergence is /x-almost surely and, since it is clearly dominated, 
also in probability. For n suitably large, fi (-Be^(i+c)(x)) < 2a(l + cY for all 
X except for a set of measure p = p{n) — )■ as n — >■ oo. 

Next we apply the uniform Glivenko-Cantelli theorem to estimate the 
number of datapoints in -Be^ (1+4(3^)- The VC dimension of the family of all 
Euclidean balls in is (i + 1 [28]. As a consequence, for any e > 0, 5 > 0, if 

•8(rf + l) 8e 4 2 
n > max < ig — , - Ig 7 

e e e 
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then with confidence 1 — S the /i-measure and the empirical measure of every 
ball differ between themselves by less than e (jioj, Theorem 7.8). Since 
for e = log n/n the expression on the right hand side is of the order n — 
n log log n/ log n < n, it follows that for e = u(\ogn/n) and a fixed 6 > 
the conclusion follows for n sufficiently large. Due to our assumption on k, 
we can set e = Q{k/n) and conclude that, again for n sufficiently large, with 
high confidence, as n — )■ oo, we have 



< 6k{l + c) 



and besides 

\X n B2k/nix)\ > k. 

Thus, with confidence 1 — 5, if X is among fc-ANN of X', then the empirical 
measure Hn of the ball of radius ||X — X'\\ centred at X' is at most 6k{l + cY. 



According to Lemma 1.11 of ll|, page 171 (Stone's Lemma), the empir- 



ical measure of the set of all such X' is at most 7^ ■ 6fc(l + c)*^ = 6/c(l + c)^, 
where 7^ is an absolute constant only depending on the dimension d. This 
means that for all samples Xi, X2, . . . , X„ of measure 1 — 6 and a random 
X independent of Xj's the number of points Xj having X as their fc-ANN is 
bounded by 

6kil + c)l 

Denote the set of i.i.d. samples verifying this condition by G C Q"-. One 
has fJ,{G) > 1 — 6. According to the "confidence is cheap" principle, one can 
assume here that 5 — )■ with any rate of convergence subexponential in n, 
for instance, as 1/n. 

Now we proceed to verifying the condition (i). Let A C R'^ be a Borel 
subset. The quantity that we need to bound can be estimated as follows: 

E I Vrn,j(a;)xA(^i)l < ^e/v'/ix ican be among feANN of X 



One has 



.2 = 1 J I i=l 

f " 

L i=i 

if " 
-eIxgxa{x)Y,1 



■[^can be among /cANN of JCi lH. J'^i , . . . , 1 1 , . . .,^71 } 



{Xcan be among fcANN of 111 Xi , . . . 1 , . . } 



1 



< -E{xA6A;(l + c)]} = 6/i(^)(l + c; 



d 
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and 



if " 
< - ■ -■fi{A)-n ' 



{Xcan be among fcANN of Xi m X\ ,. . . ,Xi-i ,X ,XiJ^i ,. . . ,Xn\ 



k n k 
This finishes the proof. □ 



6. Query instability in high dimensions 

The fc-ANN classifier has not been tested in practice. However, within a 
theoretical model, it is still not free from the curse of dimensionality. Namely, 
assuming the datapoints follow a high- dimensional distribution on (such 
as the gaussian), it is not difficult to prove, as a version of the well-known 
"empty space paradox," that for a fixed c > in the limit d ^ oo the number 
of datapoints must grow exponentially in dimension d in order to maintain 
the consistency of the algorithm. Indeed, the ball of radius (1 + c)ek.nN 
around the query point will contain ^ k datapoints, so it is conceivable that 
the labels of k points chosen among them by the oracle will be highly biased. 

Here are two examples. The first one is the Segment dataset of the UCI 




Figure 4: Average empirical nieasure of the balls of radius (1 + c)efe.NN. 



data repository, which has a relatively low intrinsic dimension in any possible 
sense. The second is a randomly drawn dataset from the gaussian distribution 
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in dimension 14, whose instrinsic dimension can be described as medium. 
The graph of the distribution function of the average number of nearest 
neighbours depending on the distance to the query point is shown in black. 
Set k = 20 and c = 0.5. The left vertical line corresponds to the average 
value of the fc-NN radius ^fe.NN, and the second line corresponds to (l + c)£fc.NN- 
For the Segment dataset, the latter ball contains on average 60 datapoints. 
However, for the gaussian the corresponding value is already 1742. 
This brings us to the following definition. Let us say, following 



30 , that 



a range query (g, r) is c-unstable if the (1 + c)eNN{(l)-^sl\ around q contains 
at least a half of all datapoints. (Cf. Figured) 




Figure 5: Query instability. 



Under the subexponential data size growth assumption 

d = u(\ogn), 

as well as a certain general assumption of intrinsic high-dimensionality of the 



underlying measure distribution |3ll . l32| , one can prove that asymptotically 



an overwhelming majority of queries are (1 + c)-unstable. (Cf. theorem 2.1 
in jij.) This assumption is met by the gaussian measures on W^, the uniform 
measures on the cubes f^, the uniform measures on the Hamming cubes 
{0, l}*^, and so forth. In such a situation, the {k, c)-ANN search problem 
can be essentially answered by returning k randomly picked datapoints. The 
fc-ANN classifier becomes meaningless, because the Bayes error approaches 
1/2 in the limit d — )■ oo. 

The exponential rate of growth of dataset size n with regard to dimension 
d is of course unrealistic. This means that at least in some theoretical situ- 
ations (i.i.d. sampling from an artificial high- dimensional distribution) even 
allowing for approximate nearest neighbours will not save the fc-NN classifier 
from the curse of dimensionality. 
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7. Borel dimensionality reduction 



Basic concepts of descriptive set theory [lO| offer a new approach to di- 
mensionality reduction in the context of statistical learning. With this pur- 
pose, let us re-examine the standard setting for (non-parametric) statistical 
classification as outlined in Section |2] above. 



7.1. Borel sets, mappings, and isomorphisms 

Recall that a family ^ of subsets of a set X is a sigma-algebra if £/ 
contains X and is closed under the complements and unions of countable 
subfamilies. A set X equipped with a sigma-algebra ^ of subsets is called 
a measurable space. Let now X be a separable metric space. The Borel 
sigma-algebra of X, which we will denote ^x, is the smallest sigma-algebra 
of subsets of X containing all open balls. In particular, contains all open 
and all closed subsets of X, all intersections of countable families of open 
sets (G^-sets), all unions of countable families of closed sets (Fo.-sets), and 
so on. In fact, Borel subsets are so numerous that it is not easy to exhibit a 
constructive example of a non-Borel subset of a separable metric space such 
as M. 

A mapping f : X Y between two separable metric spaces is Borel (or 
Borel measurable) if the inverse image f~^{B) of each Borel subset of Y is 
Borel in X. Equivalently, the inverse image of every open ball in F is a 
Borel subset of X. For instance, the indicator function of the rationals is a 
Borel function. By changing the values of a Lebesgue measurable function 
on a suitable null-set, one can obtain a Borel function. This stresses how 
numerous Borel sets and functions are. 

A bijective Borel mapping whose inverse mapping is also Borel is called a 
Borel isomorphism. It turns out that from the Borel isomorphic viewpoint, 
metric spaces do not differ between themselves that much. More precisely, 
two complete metric spaces X and Y of the same cardinality are Borel iso- 
morphic. Thus, the Cantor set, the closed unit interval, the Euclidean space 
M*^, the infinite-dimensional separable Hilbert space i"^, and in fact all separa- 
ble Frechet spaces different from zero space are all pairwise Borel isomorphic 
between themselves. Their Borel structure is that of a standard Borel space 
of cardinality continuum. 

An example of a Borel isomorphism between the interval [0, 1] and the 
square [0, 1]^ can be obtained by interlacing between themselves the binary 
expansions of x and y of a pair {x, y) e (subject to the usual precautions 
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concerning infinite strings of ones): 

[0, If 3 (O.aiaa . . . , Q-hh . . .) ^ (0.ai6ia2&2 • • •) ^ [0, 1]. (1) 
A geometric representation of this isomorphism can be seen in Figure O 



( 



C 



Figure 6: Constructing a Borel isomorpliism between the square and the interval. 



This of course extends to any number of dimensions. 

Usually the mappings performing the data reduction of the domain are 
assumed to be continuous, even Lipschitz. However, if one looks at the 
existing theoretical model laying down a foundation for statistical learning, 
one can notice that Stone's theorem is in fact insensitive to the Euclidean 
structure (that is, either metric or topological structure) on the domain as 
long as the underlying Borel structure remains intact. This allows for a very 
simple "Borel isomorphic data reduction" to the one-dimensional case, after 
which the algorithm still remains universally consistent. 

7.2. Borel isomorphic dimension reduction 

The following result, although straightforward, offers, in our opinion, a 
potentially interesting new approach to dimensionality reduction in statistical 
learning theory. We consider it as the central result of the work reported. 

Recall that a standard Borel space is a complete separable metric space 
equipped with its Borel structure. 

Theorem 7.1. Let Q be a standard Borel space. Fix a Borel isomorphism 
0: f2 — )■ X, where X = {X, dx) is a metric space in which the k-NN learnig 
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rule is universally consistent (for instance, M"^ or its metric subspace). De- 
fine a metric p on Q by 

p{x,y) = dx{(j){x),(f){y)). 

Then the learning rule on Q given by the k-NN rule with regard to the metric 
p is universally consistent. 

Before proving the result, we need to fix notation and terminology. 

A metric space with measure, or an mm-spaces, is a triple {Q,p,p), con- 
sisting of a separable metric space [Q, p) and a Borel probability measure p on 
this space. This is an important notion in modern geometry and functional 



analysis |33 



The basic object of classification theory will be very similar, with the only 
difference that the probability measure p is now defined on f2 x {0, 1}. Equiv- 
alently, such an object can be described as a metric space (fi, p) equipped 
with a pair of finite measures /iq, pi, whose total mass adds up to one: Pq is 
the restriction oi p to VL x {0}, and pi, to VL x {1}. Let us call such objects 
mm2- spaces. 

Two metric spaces with measure {Qi, pi, pi), i = 1,2 are isomorphic if 
there is an isomorphism between them, that is, a mapping : Qi ^ Q2 which 
is an isomorphism of measure spaces and which preserves the metric almost 
everywhere. The concept of an isomorphism between our mm2-spaces is de- 
fined similarly: it is a measurable mapping which preserves po, pi, and 
which preserves pairwise distances between points {po + /ii)-almost every- 
where. 

Let {fl, p, Po, pi) and {Q' , p' , p'^, p[) be two isomorphic mm2-spaces, with 
an isomorphism (p: Q — )■ Q'. Let £ be a learning rule in Q. Then one 
can define a learning rule in the space {Q' , p' , Pq, p[) using the isomorphism, 
as follows. Denote the inverse measurable isomorphism to (p. We will 
denote by the same symbol (f)~^{s) the image of a labelled n-sample s = 
{xi, X2,..., Xn, yi, 1/2, • • • , Vn), that is. 



-l(s) = (0 ^(X2),...,0 ^{Xn),yi,y2,---,yn) 



Now set 



This is a learning rule on a transport of the rule C along the map 
The following should now be obvious. 
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Lemma 7.2. The learning rule in Q' has the same learning error as L 
does in Q. □ 



Proof of Theorem 17.11 It is enough to notice that the fc-NN classifier 
in the metric space {fl, p) is the transport along (p of the /c-NN classifier 
in the metric space {X,dx)- For every probabihty distribution p on VL x 
{0, 1}, denote the push-forward of along (p. This is a Borel probability 
distribution on X x {0, 1}, and clearly the Bayes error of equals that of 
0*//. In view of our hypothesis of the universal consistency of the /c-NN 
classifier in X, the learning error of the classifier equals to Bayes error. Due 
to Lemma \TJ\ the same conclusion holds for the /c-NN classifier in the metric 
space (f2,p). 

In particular, there is always a Borel isomorphic reduction of the problem 
in M'^ to c/ = 1. Moreover, the reduction even applies to functional data 
learning in the most general situation imaginable, when VL is an arbitrary 
separable metric space, for instance a separable Banach space, or even a 
separable Frechet space. 

Example 7.3. The histogram learning rule in the cube I'^ is a Borel iso- 
morphic reduction to the Cantor set (a zero-dimensional compact metrizable 
space without isolated points) equipped with a non-archimedian metric. 

Example 7.4. The distance metric learning methods, such as LMNN (see 
e.g. |3J]), are based on selecting an alternative euclidean metric in R''. This 
is equivalent to selecting a linear isomorphism from to itself and using 
the learning rule in the original space. Here, is of course not just a 
Borel isomorphism, but moreover a homeomorphism. 



8. Discussion 

In this article, we suggested two novel approaches to dimensionality re- 
duction in the context of the /c-NN classification: the /c-approximate nearest 
neighbour rule and the Borel isomorphic dimensionality reduction. The clos- 
est counterpart in the literature is an approach based on a combination of 



the /c-NN classifier with random projections [35|, |36j. Notice, however, that 



this is different from our approaches: first, not every indexing scheme for 
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Iris 


Diabetes 


Ionosphere 


Balance- scale 


Dimension reduction 


4 - 


1 


8 - 


1 


33 - 


1 


4 


1 


Number of Instances 


150 


768 


351 


625 


Correctly Classified 


144 


140 


569 


562 


316 


300 


561 


407 


Same, % 


96.0 


93.3 


74.1 


73.0 


90.0 


85.5 


89.9 


65.1 


Incorrectly Classified 


6 


10 


63 


199 


35 


51 


207 


218 


Same, % 


4.0 


6.67 


25.9 


27.0 


10.0 


14.5 


10.8 


34.9 


Optimal value of k 


6 


3 


17 


10 


2 


8 


9 


18 



Table 1: fc-NN classification of some datasets in UCI repository before and after a Borel 
dimensionality reduction as in Eq. ([T]), using RWeka (1 < < 20, 10-fold cross-validation). 



ANN search is based on random projections, and second, projections are not 
Borel isomorphisms. 

A more technical paper about the approximate k nearest neighbour clas- 
sifier in which the algorithm has been implemented and tested is currently 



in preparation [37|]. The Borel isomorphic dimensionality reduction is easy 
to implement, and we invite the readers to try their hand at it. As a word 
of caution, if the original domain is high dimensional, this may necessitate 
using fioating-point arithmetic. 

The initial experiments with data from the UCI machine learning reposi- 
tory (Table [T]) show that the Borel isomorphic reduction succeeds at least for 
some datasets, and fails for others. On the one hand, the richness of the class 
of Borel isomorphisms is enormous, and the failure can be always attributed 
to a poor choice of a reduction. On the other hand, it is definitely hard to 
expect such a simple idea to give a panacea of the curse of dimensionality. 
What is probably a realistic expectation, is a possibility to slash a high di- 
mension by a given factor without degraing the performance, by arranging 
the dimensions of a domain in groups of <^ n and performing a Borel iso- 
morphic reduction M.^ — > M on every such group separately. An interesting 
perspective is a theory of capacity for families of Borel isomorphisms between 
a given domain and a fixed lower- dimensional space (e.g. M), enabling search 
for an optimal Borel data reduction for a given dataset. 

At a theoretical level, this brings up the question, what is dimension in 
the context of statistical learning? It took mathematicians roughly half a 
century, to isolate the correct notion of dimension of a topological space and 
obtain the basic results of the theory (roughly, 1873-1921, see |38|). Will it 
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take the same amount of time to put forward a satisfactory theory of intrinsic 
dimension of data in the context of statistical learning, in order to explain 
away the curse of dimensionality? 

What our investigation demonstrates, is that such a notion should reflect 
not the dimension of the domain per se, but rather the complexity of the 
target classifier in the setting of a mm2-space consisting of a metric domain 
and a probability distribution on x {0, 1}. An important factor is 
the isoperimetric behaviour of the unknown target concept C, by which we 
mean the rate of growth of the function 

£ ^/i(C,), 

where Ce is the e-neighbourhood of the concept. A fast growing isomerimetric 
function reflects the high complexity of the margin and a consequent difficulty 
of finding a classifier. 
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